专利摘要:
The invention relates to a method and an apparatus for a personal voice dialler system, more particularly to a system and method for providing a user with the ability to dial a telephone number simply by saying the name of the called party. The novel method and apparatus use a receiver/dialler unit to selectively direct a voice path to communication with the telephone network and a PVDmanagement unit. The invention uses a speech recognition unit that selects from a speech recognition dictionary an entry most likely to match a spoken utterance. The invention makes use of a directory listing and speechware builder unit to create the speech recognition dictionary. The invention allows a user to access a listing of names and phone numbers in a database using voice with no or minimal amounts of training on the user's side for the speech recognition process.
公开号:CA2256781A1
申请号:C2256781
申请日:1998-12-21
公开日:2000-03-14
发明作者:Randy Grant Fehr;Stephen Jay Copeland;Raymond J. Kenworthy;Daniel D. Ferrero;Fekri Henein
申请人:Nortel Networks Corp;
IPC主号:H04M1-26
专利说明:
[1" class="description-paragraph] Title: Method and Apparatus for Automatically Dialling a Desired Telephone Number Using Speech Commands s Field of the invention This invention relates to the field of automating the execution of certain actions by uttering a request. The invention is particularly well suited for managing communications (data and voice) and other control tasks such as effecting voice dialing, transferring calls, establishing conference calls, managing e-mail, retrieving web-based information, controlling house appliances using X10 commands and performing other speech enabled services.1 s Background of the invention Personal computing devices have become wide spread household appliances and business tools. Users perform a number of tasks on these machines such as word processing, financial analysis, keeping track of inventory and so on. Personal computing devices are also an efficient way of keeping track of various information. Of particular interest is the use of personal computing devices to keep track of personal contacts. Using either regular text files or task specific applications such as personal information managers (PIM), users can keep track of various information such as a person's name, address, telephone number, business number and so on. Personal contact managers allow users, when the personal computing device has a telephone connection, to select a name via a keyboard or pointing device and dial the number associated with the name to establish a connection. In a typical interaction, the user first opens the personal call manager application. The user
[2" class="description-paragraph] 2 then locates in a database of entries the entity he wishes to contact. This is usually performed by entering the name via a keyboard to instigate a search or by visually locating the entity on a display unit. The user then selects the phone number and initiates the call by entering a suitable command. Once the connection is made, the user may via a microphone or telephone set communicate with the called party. Systems of the type described above require the user to locate the entry in the database and to select the phone number to be dialled. This is often inconvenient for the user, who is required to make multiple entries on a keyboard or use a pointing device for a single telephone call. Furthermore, when the database contains a large number of entries, locating the desired entry may become a tedious task or may require the user to type in the name of the entity sought for the search unit. For example, a user using a simple ASCII text file to enter data relative to contacts may have to scan the file in its entirety before locating the correct number. Another drawback of this approach is the requirement in the computing device of a peripheral that can effect the dialling function. Typically, such peripheral is in the form of a modem that can be used to effect data exchange operations though the telephone network. In addition to the cost of acquisition of such modem, the set-up is not desirable for a number of reasons. Since the modem is located in the telephone line path, it needs to process all the control signals that are exchanged between the Customer Premises Equipment (CPE), also called telephone instrument, and the switch of the telephone network. First, the modem may not be able to adequately process such signals with the result that some
[3" class="description-paragraph] 3 functionality of the CPE may no longer be available. Second, the permanent connection of the modem to the telephone line may create a security issue, since it may permit access to the files stored on the computing device through an external dial-up connection. S Thus, there exists a need in the industry to improve the process of automatically dialling a desired number from a set of contacts using a personal computing device. Objects and Statement of the Invention An object of the invention is to provide a computer readable storage medium containing a novel program element that can assist the establishment of a telephone connection by uttering a command. A further object of the invention is to provide a digital computer peripheral unit, capable of interfacing with a multipurpose computing device, for establishing a telephone connection in response to commands received from the computing device. Yet, another object of the present invention is to provide a novel voice dialing system, that utilizes the resources of a multi-purpose computing device that is interfaced with a peripheral unit effecting the actual call establishment functions. As embodied and broadly described herein the invention provides a computer readable storage medium containing a program element for use with a computing device coupled to a voice path switching device, the program element being operative to direct the
[4" class="description-paragraph] 4 computer to control the voice path switching device, the voice path switching device including:- a first input for receiving a speech signal;- a second input for connection to the computer;- a first output for connection to a telephone network;- a second output for connection to the computer, the voice path switching device being capable of acquiring two operative modes, namely a first operative mode in which a speech signal received at the first input is transmitted to the first output and a second operative mode in which a speech signal received at the first input is transmitted to the second output for processing by the computer, the voice path switching device being responsive to a task specification signal received at the second input from the computer to output from the first output a telephone network control message;a computer including:- memory means including:a) a dictionary containing a plurality of vocabulary items potentially recognizable on a basis of a spoken utterance;b) a plurality of task data elements associated with respective vocabulary items;- processor means in operative relationship with said memory means, said program element instructing said processor means to:a) receive a signal representative of a spoken utterance from the voice path switching device;b) search the dictionary for a vocabulary item potentially matching the spoken utterance;c) retrieve the task data element associated to the vocabulary item potentially matching the spoken utterance; d)generate a task specification signal on a basis of the task data element associated to the vocabulary item potentially matching the spoken utterance; e)transmit the task specification signal to the voice path switching device to permit the voice path switching device to generate a telephone network control signal over the first output; f)generate a control message for the voice path switching device for causing the voice path switching device to acquire the first operative mode; g)transmit the control message to the voice path switching device, whereby the telephone network control signal enables establishment of a telephone connection and a speech signal received at the first input of the voice path switching device is transmitted to the first output of the voice path switching device. For the purpose of this specification, the expression "word"designates a textual representation of a spoken utterance. In a specific example, a textual representation is a collection of written symbols or characters that are components of an alphabet. For the purpose of this specification, the expressions "orthographic representation" and "orthography" are used interchangeably. An "orthography" is a data element in a machine readable form that is an electronic representation of a word. Typically, an orthography is a collection of symbols mapped to the characters forming the word. The expression "orthography" also includes data structures including solely or in part pointers or links to locations, such as in a memory for example, that contain the actual representation of the word. For the purpose of this specification, the expression "utterance" is a sound or combination of sounds that form a meaningful linguistic unit.For the purpose of this specification, the expression "transcription" is used to designate a machine readable data element that is a combination of symbols providing information on how a sub-word unit, such as a letter or a combination of letters, may be pronounced. For example, a simple word like "a" is often pronounced as /ey/ in isolation but as /ax/ in context. Another example is a word like "data" which can be pronounced as /d ey t ax/ or /d ae t ax/ depending on the speaker's dialect. Typically a word may have several transcriptions, where each transcription may be a different pronunciation of the word. The symbols that compose a transcription may be phonemes, allophones, triphones, syllables and dyads (demi-syllables). Although the definition of "transcription" herein refers to a data structure including symbols, it should be understood that what is meant is a data element having constituent parts that are representations of the symbols. The table below illustrates examples of words (or orthographies) and the associated transcriptions. KEYWORD TRANSCRIPTION "COMPOSE" < kl*4m6-p1o2z0> KEYWORD TRANSCRIPTION "SLOW DOWN" < s819o3 dlawOnlS> In the above table, each alphanumeric character in the transcription is an allophone. The character is a phoneme and the digits following the character indicate a variation of the phoneme in a certain acoustic context (allophone). The "-" character is the inter-word silence and "-" is the syllabic mark. For the purpose of this specification, the expressions "model"and "speech recognition model" are used to designate a mathematical representation of a speech signal. Speech modeling is well known in the art of speech recognition. Commonly used models in speech recognition are Hidden Markov Models (HMMs) where each phoneme can be represented by a sequence of states and transitions in between these states. For basic information about HMMs the reader is invited to consult "An Introduction to Hidden Markov Models", L.R. Rabiner and B. H. Juang, IEEE ASSP Magazine, January 1986, pp.4-16 whose content is hereby incorporated by reference. In a specific example, the invention is embodied in a personal voice dialing system. In a most preferred embodiment of the invention, the Personal Voice Dialer (PVD) system provides the user with the ability to dial a telephone number simply by saying the name of the called party. Structurally speaking, the PVD system includes a computer to which is coupled a voice path switching device. The voice path switching device is a peripheral that is controlled by the computer to permit establishment of a telephone connection. Most preferably, the voice path switching device is integrated in a CPE or any other suitable telephone instrument. Such telephone instrument can be fixed or mobile. The voice path switching device is preferably coupled to the computer through a Universal Serial Bus (USB) connection. Such USB connection allows a high rate data transfer and has been found to be superior over alternative connection schemes, such as schemes based on serial port communication. The voice path switching device receives a speech signal and can selectively direct the signal either to the computer for processing or to the telephone network. In addition, the voice path switching device can receive task specification signals from the computer and issue telephone network control signals. The computer is provided with a program element that controls the operation of the PVD system. The program element can receive a spoken utterance from the voice path switching device (typically, such spoken utterance is the name of an entity the user wishes to call), perform speech recognition in a dictionary to recognize the spoken utterance and retrieve the telephone number associated with that entity, issue a task assignment signal that directs the voice path switching device to output the appropriate telephone network control message. In a specific example, the task assignment signal includes the telephone number of the party to be called. The telephone network control message produced by the voice path switching device is the actual dialing signal to establish the telephone connection. Consider a more specific example where the voice path switching device is integrated in a CPE. The user lifts the handset or activates the hands-free microphone of the CPE. The voice path switching device sends a message to the program element in the computer so as to activate it. In response to this message, the program element instructs the voice path switching unit to acquire an operative mode such that the voice path from the microphone is directed to the computer rather than the telephone line. The program S element then causes generation of a prompt to the user. This prompt is a signal that travels over the USB connection. The prompt may be a sentence instructing the user to speak, an earcon indicating that the system is waiting for an input or another type of prompt. Alternatively, the prompt may be solely visual such as a display on a video screen. Following the prompt, the program element monitors the USB connection for incoming data. The user formulates his request by speaking into the microphone of the CPE. In this example, the spoken utterance is representative of the entity to be called. As the user speaks the name of a person, the utterance is converted into a signal format suitable for processing by the program element. Preferably, the transmitted signal is a digital signal produced by well-known methods of speech coding. The program element receives the coded voice signal and processes it. This processing includes passing on the coded voice signal to the speech recognition unit. If the speech recognition attempt was successful, the program element retrieves the phone number associated with the top candidate and generates the task specification signal than contains the textual representation of the top name (orthographies), its associated phone number and, optionally, the names of a file containing the audio playback associated with the recognized names. This task specification signal is passed over the USB link to the CPE. The task specifications signal is processed and top recognition result (textual representation of the top name) is presented to the user on a display for confirmation in order to allow the user to terminate the call if the recognition result is not what the user said. The presentation of the results to the user is preferable to ensure that the user can hang-up in the case of an incorrect recognition. The user receives the recognition result and is given a certain pre-y determined delay time to reject the result. In a specific example, the user may reject a recognition result by hanging up the phone set. After the pre-determined delay time, dials the phone number included in the task specification signal. This can be effected by sending a confirmation command to the CPE. The program element also sends over the USB link a control message to cause the voice path switching device to acquire a different mode of operation where the speech path is routed toward the telephone line, rather than the computer. This is effected such as to allow establishment of a speech path with the called party. Optionally, the program element stores recognition results that have been rejected by the user. Preferably, the program element stores the rejected name and the time when the rejection occurred. If during a subsequent attempt within a given time interval the speech recognition unit returns the same name as best match, the program element select the second best match or another match to avoid giving to the user a name he has already rejected. In the preferred embodiment, the time interval is set to 30 seconds. In a preferred embodiment, the speech recognition dictionary used by the program element is generated on the basis of the orthography of the words in a directory listing. In a preferred embodiment the speech recognition directory interacts with a speechware builder unit. The speechware builder unit processes entries in the directory listing stored on a computer readable storage unit to create entries for the speech recognition dictionary. In a preferred embodiment, the user creates manually the directory listing. Preferably, a single instance of each name is present in the directory and full names (including the family name) are used in the directory. Short names and nicknames may also be used depending on the recognition accuracy of the speech recognition unit. The speechware builder unit extracts from the directory listing the elements to be recognized and creates an entry in the speech recognition directory. These entries are also referred to as speechware. Preferably, the speechware builder unit takes automatically into account common nicknames (e.g. Dave for David) and expands common abbreviations (eg. Saint for St). The entries created by the speechware builder unit include a phonetic representation of the names associated to models of sounds that are used to recognize the user's utterance - the speech models. The characteristics of the speech models may differ without detracting from the spirit of the invention. In a preferred embodiment, the entries in the speech recognition dictionary are created by using a text to transcription module to process entries in the directory listing in order to obtain a plurality of transcriptions using information derived from text to phoneme rules. The transcriptions generated by the text to transcription module are then stored on a computer readable storage medium containing the speech recognition dictionary. In a preferred embodiment, the speechware builder unit may be provided with an update function to update the speech recognition dictionary. In a typical interaction, the user may modify the contents of the directory listing by adding a new entry, removing an entry or modifying an already existing entry. The user may then request the update of the speech recognition dictionary. The orthographies of the new entries are extracted and the speechware builder unit applies the algorithm to create speechware. Finally, once the speechware builder is finished, the newly created entries are stored in the speech recognition dictionary. Optionally, the speechware builder unit may update the speech recognition dictionary every time the dictionary listing is modified. The invention could be embodied in other forms and do other tasks besides dialing telephone number based on spoken utterances. It could be used for other communication management tasks such as transferring calls, establishing conferences, managing e-mail, retrieving web based information or controlling house appliances using X10 commands. Alternative embodiments could operate remotely, so the user could call his phone from a distance, get access to his contact data and use speech recognition facility. The invention may also be adapted for multi user environments. For example, PVD would be running on a server PC to provide small and large corporations with services such as Corporate Name Dialing or Centralized PVD. As embodied and broadly described herein the invention further provides A voice path switching device comprising:- a first input for receiving a speech signal;a second input for connection to a computing unit;- a first output for connection to a telephone network;a second output for connection to the computing unit, the voice path switching device being capable of acquiring two operative modes, namely a first operative mode in which a speech signal received at the first input is transmitted to the first output and a second operative mode in which a speech signal received at the first input is transmitted to the second output for processing by the computer, the voice path switching device being responsive to a task specification signal received at the second input from the computing unit to output from the first output a telephone network control message, and the voice path switching device being responsive to a control message whereby the operative mode is selected. As embodied and broadly described herein the invention yet provides a personal voice dialing system including in combination the voice path switching device and the computer running the program element described above. Brief description of the drawings These and other features of the present invention will become apparent from the following detailed description considered in connection with the accompanying drawings. It is to be understood, however, that the drawings are provided for purposes of illustration only and not as a definition of the boundaries of the invention for which reference should be made to the appending claims. Fig. 1 shows a simplified functional block diagram of a personal voice dialler system in accordance with an embodiment of the invention; Fig. 2 shows a simplified diagram of the voice path switching device in accordance with the spirit of the invention; Fig. 3a and Fig. 3b show embodiments of a speechware builder unit; Fig. 4 shows the communication flow between modules of an embodiment of the voice dialing system; Fig. 5 shows a flow diagram of the process of updating the speech recognition dictionary; Fig. 6 and 7 show specific embodiments of the invention. Description of a preferred embodiment In a most preferred embodiment of the invention, the Personal Voice Dialer (PVD) system provides the user with the ability to dial a telephone number simply by saying the name of the called party. As shown in figure 1, such a system generally comprises a number of functional modules, namely a voice path switching device 100 that could be part of a CPE. The voice path switching device includes an input for receiving a speech signal (from the microphone of the handset or the hands-free microphone) and two outputs. One of the outputs connects with the telephone network 102 over a telephone line. The second output is the PVD connection with the computer (to be described in detail later). Most preferably, the PVDconnection is established over a USB link. The PVD connection is a bi-directional path. This path also forms the second input of the voice path switching device 100. This second input receives control signals issued by the computer that determine what are the functions to be executed by the voice path switching device. The computer system executes a program element that includes a number of functional blocks, namely a speech recognizer unit 106, a personal voice dialer management unit 104 and an information retriever unit 114. The Personal Voice Dialer (PVD) system uses 1$information stored on a computer readable medium containing a speech recognition dictionary 110. Additionally, the Personal Voice Dialer (PVD) system may further comprise a speechware builder unit 108 and computer readable medium containing a directory listing $ 112. In a typical interaction as shown in figure 4, the user lifts the handset or microphone 406 of the CPE. The voice path switching device 100 sends a message 408 to the Personal Voice Dialer management unit 104 over the PVD connection. The user hears the dial tone 410 and invokes 412 the Personal Voice Dialer management unit 104 herein referred to simply as PVD management unit. To this effect, the voice path switching device 100 sends a message 414 to instruct the PVD management unit to become active. In response to 1$ this message 414, the PVD management unit 104 instructs the voice path switching device 100 to re-route the voice path from the telephone line to the PVD connection. The user then ceases to hear the dial tone 418. This procedure of re-routing the voice path can better be understood by referring to figure 2 of the drawings. In a simplified form, in the default mode of operation of a voice path switching device, voice data is transmitted from the handset or microphone to the telephone line. The voice path switching device 100 of the present invention transports two types of information namely PVD control and data signals 200 and voice data 202. When the 2$ PVD management unit is activated, the voice path switching device under the instructions of the PVD management unit acquires a different mode of operation in which it redirects the voice path to the PVD connection. In most a preferred embodiment, the user invokes the PVD by pressing a button or key on the voice path switching device, by entering a code using a keyboard or by any other suitable means. Alternatively, the PVD management unit may be activated by default when the handset is taken off hook and thus no action is required on the user's side to activate the PVD. The PVD management unit 104 then issues a prompt to the user 420. The prompt may be a sentence instructing the user to speak, an earcon indicating that the system is waiting for an input or another type of prompt. An earcon is an auditory signal intended to convey some information about a type of desired information. Earcons are well known in the field of human-computer interaction. Alternatively, the prompt may be solely visual such as a display on a video screen provided on the CPE and controlled by the voice path switching device. Following the prompt 420, the PVD management unit 104 monitors the PVD communication channel over the USB link for incoming data. The user receives the prompt 422 and then formulates his request by speaking 424 into the microphone or telephone set. In a preferred embodiment, the speech formulated by the user, herein designated as spoken utterance, is representative of the name of he entity to be called. Optionally, the system may allow the user to specify a particular location for the entity he wishes to reach. For example, the user may say "Peter Jones business" or "Marie Adams home". As the user speaks the name of a person, the utterance is converted into a signal suitable for transmission over the PVD channel. Preferably, the transmitted signal is a digital signal coding using well-known methods of speech coding. The speech coding function may be effected by the voice path switching device or externally on the speech source side. Under this embodiment, the speech source input of the voice path switching device receives encoded speech data. The PVD management unit 104 receives the coded voice signal and processes it 426. In a preferred embodiment, the PVD management unit 104 passes on the coded voice signal to the speech recognition unit 106 of the system. The speech recognition unit 106 receives a request for recognition from the PVD management unit 104. It then starts processing the speech data. In a specific example, the data is processed synchronously as the data arrives. The speech recognition unit 106 returns the result of the search. In a specific example, the speech recognition unit 106 returns two entries that best match the user's utterance. In a preferred embodiment, if the speech recognition procedure fails and the number of failures in the current connection attempt is inferior to a maximum allowable number of attempts, the PVD returns to step 420 and prompts the user to repeat the request again. If the recognition attempts fails and the maximum allowable number of attempts is exceeded, the PVD process ends, preferably issuing an audio message advising the user that the recognition has failed or by issuing an earcon characteristic of a recognition failure. If the speech recognition attempt was successful, the information retriever unit 114 locates in the directory listing 112 the task data element which in this example is the phone number associated with the top recognized name. The speech recognition unit 106 passes on to PVD management unit 104 a task specification signal containing the textual representation of the top name (orthography), its associated phone number and, optionally, the names of a file containing an audio playback associated with the recognized names. The task specification signal is delivered to the voice path switching device over the PVD connection. Optionally, the top recognition result is presented to the user on a display controlled by the voice path switching device for confirmation 428 in order to allow the user to terminate the call if the recognition result is not what the user said. The presentation of the results to the user Ig is preferable to ensure that the user can hang-up in the case of an incorrect recognition. In a preferred embodiment, the system does not wait for user confirmation but rather waits a pre-determined amount of time to give a chance to the user to reject the recognized name. If the system is achieving a high rate of recognition, it might be preferable to avoid the delay of a confirmation. The recognition result may be presented in a number of ways namely, if an audio file is provided the contents of the file may be played for the user. In a specific example, the audio files are simple wav files that the user can produce using a program such as Windows' Sound Recorder~ running on a multi purpose digital computer. Another alternative still is to make use of text-to-speech technology in order to convert the orthography of recognized name into an audio signal. There are many methods of text to speech synthesis that may be used in the context of this invention. The user receives the recognition result 430 and is given a certain pre-determined delay time to reject the result. Next, the PVD management unit 104 sends a control message to the voice path switching device 100 to re-route the voice path 432 from the PVD communication channel to the telephone line. The PVDmanagement unit 104 effects dialing of the phone number by sending an additional command to the voice path switching device 100 that is the end of the task specification signal. Accordingly, the task specification signal includes in this example, at least two components, namely the telephone number and a confirmation to dial that number. These components are sent at different times, separated by the control message. In a possible variant, the confirmation may be omitted. Thus the task specification signal includes at least one component which is the telephone number to be dialed. After dialing, the PVD management unit 104 relinquishes control of the call. When the user hangs up 436, the voice path switching device 100 sends a message 438 to the PVD management unit 104 to indicate that the connection has been terminated. Optionally, if a PIM is running when PVD has given-up control of the call, the PIMdetects that a call is in progress and takes over control of the call. In the above description, the step of prompting the user 420 can be done in a number of ways. In a specific embodiment, the prompt is of type "single no prompt mode". The user specifies only a name or a name and a location, in a single utterance - "John Hill" or "John Hill mobile"to get connected to the associated phone number. In this mode the user goes off-hook, activates the PVD, gets a short earcon (a beep), says a name and a location, receives the recognition result and gets connected (or hangs-up in case of a recognition error). Preferably, a default location is provided and used if the user does not specify the location he desires. In a specific example, a request for "John Hill" would be connected to the home phone number if Home number is designated as the default. In another specific example, the prompt is "dual mode with prompt". This mode involves dividing the prompt into a two step operation using two distinct utterances, one for the name and one for the location. For example, the dialog would be: PVD prompt: "Name " User . "John Hill" PVD prompt: "Location " User . "Mobile" Preferably, during the prompt for location, the PVD management unit 104 presents the choices of locations for the name recognized.
[5" class="description-paragraph] 5 If the user has entered a Business phone number and a Home phone number, the PVD management unit 104 may present "Business, Home or Default" as options to the user. These messages may be displayed on the personal computer screen on which the PVD management unit 104 is operating or/and on the voice path switching device 100 if the 10 latter comprises a display screen. Preferably, if the recognized name has only one number specified in the list, then there is no question pertaining to location. In yet another specific example, the prompt is "dual mode no 15 prompts". In this prompting mode, the dialing process is accelerated by replacing the prompts for name and location by earcons only. Optionally, the PVD management unit 104 stores the recognition 20 attempts that have been rejected by the user and the time when the rejection occurred. If during a subsequent attempt within a given time interval the recognition unit 106 returns the same name as best match, the PVD management unit 104 may select the second best match or another match to avoid giving to the user a name he has already rejected. In the preferred embodiment, the time interval is set to seconds. The speech recognition unit 106 handles the speech recognition task. The speech recognition unit 106 receives from the PVD30 management unit 104 the speech signal to be processed. In a specific example, a pre-processing unit in the speech recognition unit 106 converts the signal representative of the utterance into a sequence of feature vectors or other suitable representation. For example mel-based cepstral parameters may be used to compose the feature vectors. Feature vectors are well known in the art to which this invention pertains. A search unit implementing a search algorithm then processes the feature vectors. The search unit scores entries in the speech recognition dictionary 110 and selects potential matches to the spoken utterance, herein referred to as recognition candidates. Any suitable search algorithm may be used here without detracting from the spirit of the invention. In a specific example, the search unit implements the two-pass search algorithm described in U.S. patent 5,515,475 Gupta et al. "Speech Recognition method using a two-pass search", May 7, 1996 whose content is hereby incorporated by reference. Preferably, the recognizer unit is capable of dealing automatically with affixes, such as hesitations before speaking. The operation of speech recognition units is well known in the art to which this invention pertains. For more information about speech recognizers, the reader is invited to consult the following patents and articles whose contents are hereby incorporated by reference. I U.S. PATENTSPATENT # INVENTOR 5,488,652Gre o , J. Bielb et al. 4,164,025Dubnowski et al. 4,751,737Gerson et al. 4,797,910Daudelin 4,959,855Daudelin 4,979.206Padden et al. U.S. PATENTSPATENT # INVENTOR 5,050,215Nishimura 5,052,038She and 5,091,947Ari oshi et al. 5,097,509Lenni 5,127,055 Larke 5,163,083Dowden et al. i 5,181,237 Dowden ~, 5,204,894 Darden 5,274,695Green 5,307,444Tsuboka 5,086,479Takenaga et al. OTHER ART TITLE AUTHOR SOURCE Dynamic Adaptation 1989, IEEE International of HiddenSymposium on Circuits Markov Model for and Systems, vol.2, May Robust 1989 pp.1336-1339 S eech Reco nition Unleashing TheLabov and Telesis, Issue 97, 1993, Potential of pp.23-27 Human-To-Machine Lennig, Communication An introduction RabinerIEEE ASSP Magazine, Jan. To Hidden and1986, pp. 4-16 Markov Models Juan Putting SpeechLennig,Computer, published by Recognition to IEEE Computer Work in The Telephone Society, vo1.23, No.8, Aug. 1990 TITLE AUTHOR SOURCE Network In a preferred embodiment, the speech recognition unit 106 interacts with a speech recognition dictionary 110 stored on a machine-readable storage medium. The speech recognition dictionary 110 stores a set of vocabulary items potentially recognizable on the basis of a spoken utterance such as labels and transcriptions. The transcriptions in the dictionary represent data elements that are a combination of symbols providing information on how a sub-word unit, such as a letter or a combination of letters, may be pronounced. The speech recognition unit 106 tries to match the detected speech signal with entries in the speech recognition dictionary 110 and selects the entry that is the most likely to be what the user is saying. In a preferred embodiment, the speech recognition dictionary 110 is created by a speechware builder module 108 on the basis of entries in the directory listing storage unit. The directory listing storage unit typically comprises the words the speech recognition unit 106 should recognize. The speech recognition dictionary 110 created on the basis of the directory listing storage unit 112 may be speaker dependent or independent without detracting from the spirit of the invention. Preferably, each transcription in the speech recognition dictionary 110 is associated to a label such as an orthography. In a preferred embodiment, each label in the speech recognition dictionary 110 is in turn associated to a link allowing the desired action associated to the label to be completed upon selection of a transcription associated to the label by a speech recognition system. In the preferred embodiment, the link may be a pointer to a data structure in the directory listing containing a task data element. In this example, the task data element contains the telephone number of the entity associated with the label. The information retriever unit 114 performs the linking process. The speech recognition unit 106 then returns to the PVD management unit 104 the top orthographies recognized along with their associated telephone numbers. The link S may also be an action link that identifies an action to be taken. For example in a system designed to effect a certain procedure in response to a spoken command, the link designates the action to be performed. This designation may be direct in which case the link contains the information identifying the action, or indirect where the link points to a location containing the information identifying the action. In a specific example, the system may be used to operate components on the basis of spoken commands. For instance the user may say "lights on" to indicate that the lighting in a room is to be activated. The action link in this case identifies the specific action to be taken. The link may be a data element that contains the information to the effect that the lights should be activated or it may be a pointer to a table containing the entire list of possible actions that the system can effect, the pointer identifying the entry in the table corresponding to the light activation. Thus, for the purposes of this specification, the expression "link" should be interpreted in a broad manner to designate the structure that constitutes an operative relationship between a label and the desired action that is to be performed when a transcription associated with the label is selected by the speech recognition system as a likely match to the spoken utterance. In a preferred embodiment, the speech recognition dictionary 110 is generated on the basis of the orthography of the words in the directory listing. Optionally, a general-purpose speech recognition dictionary supplemented by entries generated on the basis of the directory listing may be used for the recognition task. In a preferred embodiment the speech recognition directory 110 interacts with the speechware builder unit 108. The speechware builder unit 108 processes entries in the directory listing 112 5 stored on a computer readable storage unit to create entries for the speech recognition dictionary 110. In a preferred embodiment, the user creates the directory listing 112. Preferably, a single instance of each name is present 10 in the directory and full names (including the family name) are used in the directory. Short names and nickname may also be used depending on the recognition accuracy of the speech recognition unit 106. In a specific example, the directory listing 112 is a simple ASCII file in a predetermined format. This type of file can 15 be created using any ASCII editor. In a specific example, the entries are in the following format: Utterance:: phone no; wave file Location, synonym, synonym,...20 Sample entries following the above format are as follows: John Smith::555 8225; c:ApplicationPVDwavSmith.wav, Johnny John Smith business::555 4434; c:ApplicationPVDwavSmith.wav 25 Other suitable formats for the directory listing 112 may be used without detracting from the spirit of the invention. The directory listing 112 contains entries for the words the speech recognition system should be able to recognize. In another preferred embodiment of the invention, the names and numbers are managed by an application running in a multipurpose digital computer or other suitable personal computing device. A Personal Information Manager (PIM) is one such application. Generally, PIMs manage a database 116 containing a list of contacts with their names and multiple telephone numbers. Optionally, they allow the user to make calls, put lines on hold, transfer calls or establish conferences. Any suitable PIM can be used to manage the contact list without detracting form the spirit of the invention. The PIM database 116 can be processed by a processing unit 118 to extract the directory listing 112. The extraction of data from a database of files is well known in the art of computer sciences. The speechware builder unit 108 extracts from the directory listing 112 the vocabulary items to be recognized and creates an entry in the speech recognition directory. These entries are also referred to as speechware. Preferably, the speechware builder unit 108 takes automatically into account common nicknames (e.g. Dave for David) and expands common abbreviations (eg. Saint for St). The entries created by the speechware builder unit 108 include a phonetic representation of the names associated to models of sounds that are used to recognize the user's utterance - the speech models. In a specific example, the speech recognition models used in the recognition process are user independent models where the speech models are optimized for handsets. Preferably, speech models are tuned to the environment in which the recognition system operates. The models may also be adapted by the system to the users of the system using some adaptation method. Adaptation of speech models is well known in the field of speech recognition. The characteristics of the speech models may differ without detracting from the spirit of the invention. As shown in figure 3a, the entries in the speech recognition dictionary 110 are created by using a text to transcription module 300 to process entries in the directory listing 112 in order to obtain a plurality of transcriptions using information derived from text to phoneme rules 304. The transcriptions generated by the text to transcription module 300 are then stored on a computer readable storage medium containing the speech recognition dictionary 110. The letter to phoneme generator 300 is preferably an automatic letter to transcription generator requiring no human intervention. Preferable the text to transcription generator 300 is configured such that it generates a certain number of transcriptions. In a specific example, the text to transcription generator 300 generates 1 to 10 transcriptions per name. Variation in the number of transcriptions per orthography does not detract from the spirit of the invention. There are many types of letter to phoneme transcription methods that may be used here such as described in "Modeling Pronunciation Variation for ASR: Overview and Comparison of methods", Helmer Strik and Catia Cucchiarini, Workshop Modeling Pronunciation Variation, Rolduc, 4-6 May 1998, "Maximum Likelihood Modelling of Pronunciation Variation", Trym Holter and Torbjorn Svendsen, Workshop Modeling Pronunciation Variation, Rolduc, 4-6 May 1998, pp.63-66 and Automatic Rule-Based Generation of Word Pronunciation Networks, Nick Cremelie and Jean-Pierre Martens, ISSM 1018-4074, pp.2459-2462, 1997, whose contents are hereby incorporated by reference. In a most preferred embodiment, text to phoneme rules 304 from many languages are used to include the variations in pronunciations across languages. In a specific example, for a voice dialing system for use in North America, three sets of letter to phoneme rules may be used for three of the widely spoken languages namely French, English and Spanish. In another preferred embodiment, the transcriptions in the speech recognition dictionary 110 are derived from letter to phoneme rules as well as from language dictionaries. As shown in figure 3b, the speechware builder unit 108 further comprises a language dictionary transcription module 306 and a computer readable storage medium containing a language dictionary 308. The language dictionary provides transcriptions reflecting the correct pronunciation of words in the directory listing 112. The transcriptions resulting from the language dictionary transcription module 306 are then stored in the speech recognition dictionary 110. In a most preferred embodiment, language dictionaries 304 from many languages are used to include the variations in pronunciations across languages. In a specific example, for a personal voice dialing system for use in North America, three language dictionaries are used for three of the widely spoken languages namely French, English and Spanish. Using transcriptions from several languages admits pronunciation variants that are phonetically illegal in a given language but are valid in another language. In a preferred embodiment, the language dictionary transcription module 306 looks up the entry corresponding to each orthography in the lexicon in the language dictionary in order to generate a transcription to add to the speech recognition dictionary 110. The table below illustrates the different transcriptions that many be generated from the language dictionaries and the letter to phoneme rules for the orthographies "Brault", "Jose" and "George" in the directory listing. Lexicon English French English letterFrench letter toto ortho ra h Dictionary Dictionary phoneme rules phoneme rules Brault Bro bro Bro bro BrOlt brawlt brOlt bra-ylt Lexicon English French English letterFrench letter ortho raDictionary Dictionary toto h phoneme rules phoneme rules Joseho-ze Zo-ze dZo=s*Zo-zE dZozZoz dZo=z* George DZOrdZ DZOrdZ DZOrdZ ZOrZ dZo-ordZ ZOr-Ze dZOrZ Transcriptions in speech recognition dictionary may be organized any suitable fashion without detracting from the spirit of the invention. Alternatively, a language dictionary for a single language suitable for the user of the system may be used. Alternatively, the speech recognition dictionary 110 may be generated on the basis of spoken utterance from the user of the system using a continuous allophone recognizes. For each name in the directory listing the user speaks the word. The spoken utterance is stored on the computer readable medium as a speech file and then processed by the continuous allophone recognizes. The continuous allophone recognizes produces the transcriptions that are then stored in the speech recognition dictionary 110. The speech file can also be used in the case that the system requests confirmation from the user by saying the recognized name. Generating transcriptions on the basis of a spoken utterance is well known in the art of speech processing. Other methods of generating transcriptions may be used here without detracting from the spirit of the invention. In a preferred embodiment, the speechware builder unit 108 may be requested by the PVD management unit 104 to update the speech recognition dictionary 110. In a typical interaction as shown in figure 5, the user may modify the contents of the directory listing 500 by adding a new entry, removing an entry or modifying an already existing entry. The user may then request that the PVDsystem update the speech recognition dictionary 502. The PVD5 management unit sends a message to the speechware builder unit to update the dictionary. As a first step, if the entries are stored in a PIM database, the entries are extracted and stored in the directory listing. Following this, the orthographies of the entries are extracted 504 and the speechware builder unit applies 10 the algorithm to create speechware 506. Finally, once the speechware builder is finished, the newly created entries are stored in the speech recognition dictionary. Optionally, the speechware builder unit may update the speech recognition dictionary every time the dictionary listing is modified. Figures 6 and 7 show specific embodiments of the invention. As shown in figure 6, the Personal Voice Dialer system may be implemented in a system comprising a PVD telephone instrument 704 interfacing with a telephone line 710 and a USB line 702, a processor 700 interfacing with the USB line 702 and a computer readable storage medium 708 containing program and data elements. The PVD telephone 704 is connected to a PC via a Universal Serial Bus (USB). USB is a standard intended to replace RS232 and video connections. It provides a bandwidth of 12 MB/s, large enough to transport audio signals with minimal distortion. Other types of connections are possible. In the preferred embodiment, the processor 700 is running software units 708 implementing the following modules:1. The PVD management unit 104 to manage the interface with the voice path switching device 100 for establishing calls. 2. The speech recognizer 106 to do the speech recognition.3. The speechware builder unit 108 to transcribe the directory listing 112 into entries in the speech recognition dictionary suitable for use by the speech recognizer 106.4. Optionally, a PIM (personal information manager) application. In a specific example, the PVD telephone set 704 is a multimedia audio device and a TAPI device. Preferably, the PVDtelephone set 704 provides a dedicated PVD button and a TAPI message to trigger PVD. The telephone set integrates the voice path switching device. This function can be implemented by using hardware or software modules. As mentioned previously, the voice path switching device effects the actual voice path switching and also the generation and issuance of the telephone network control messages. When the PVD is activated, the voice path switching device cuts off the dial tone while the user is interacting with the PVD. The PVD telephone set 704 may display the text of the name and phone number on a screen. Once the PVD interaction is complete, the PVD telephone set 704 restores the audio connection to the phone line in order to access the telephone network 706. In another specific example, the Personal Voice Dialer system may be used in a system comprising a PBX. As shown in figure 7, the PVD system may comprise any suitable telephone set 716 interfacing with a voice path switching device module 712 using a standard POTSline, a voice path switching device module 712 interfacing with a PBX unit 714 and a USB line 702, a processor 700 interfacing with the USB line 702 and a computer readable storage medium 708 containing program and data elements. The voice path switching device 714 is connected to the processor 700 via with a Universal Serial Bus (USB) and to the PBX via a POTS line. The PBX then interfaces with the telephone network to establish a connection. In this embodiment, the telephone set 716 also incorporates the voice path switching device. The invention could be embodied in other forms and do other tasks besides dialing names. It could be used for other tasks such as transferring calls, establishing conferences, managing e-mail, retrieving web based information or controlling house appliances using X10 commands. Alternative embodiments could operate remotely, so the user could call his phone from a distance, get access to his contact data and use speech recognition facility. With remote operation there is no access to visual confirmation, so audio confirmation using voice files is preferable. To ensure its availability for all names, a text-to-speech system could be added to the system. In the context for remote operation, the system may make use of voice modems to receive voice data from the user and transmit it to the PVD management unit. The invention may also be adapted for multi user environments. For example, PVD would be running on a server PC to provide small and large corporations with services such as Corporate Name Dialing or Centralized PVD. Although the present invention has been described in considerable detail with reference to certain preferred embodiments thereof, variations and refinements are possible without departing from the spirit of the invention as have been described throughout the document. Therefore, the scope of the invention should be limited only by the appended claims and their equivalents.
权利要求:
Claims (6)
[1] 1. A computer readable storage medium containing a program element for use with a computing device coupled to a voice path switching device, the program element being operative to direct the computer to control the voice path switching device, the voice path switching device including:- a first input for receiving a speech signal;- a second input for connection to the computer;- a first output for connection to a telephone network;- a second output for connection to the computer, the voice path switching device being capable of acquiring two operative modes, namely a first operative mode in which a speech signal received at the first input is transmitted to the first output and a second operative mode in which a speech signal received at the first input is transmitted to the second output for processing by the computer, the voice path switching device being responsive to a task specification signal received at the second input from the computer to output from the first output a telephone network control message;a computer including:- memory means including:a) a dictionary containing a plurality of vocabulary items potentially recognizable on a basis of a spoken utterance;b) a plurality of task data elements associated with respective vocabulary items;- processor means in operative relationship with said memory means, said program element instructing said processor means to:a) receive a signal representative of a spoken utterance from the voice path switching device;b) search the dictionary for a vocabulary item potentially matching the spoken utterance;c) retrieve the task data element associated to the vocabulary item potentially matching the spoken utterance;d) generate a task specification signal on a basis of the task data element associated to the vocabulary item potentially matching the spoken utterance;e) transmit the task specification signal to the voice path switching device to permit the voice path switching device to generate a telephone network control signal over the first output;f) generate a control message for the voice path switching device for causing the voice path switching device to acquire the first operative mode;g) transmit the control message to the voice path switching device, whereby the telephone network control signal enables establishment of a telephone connection and a speech signal received at the first input of the voice path switching device is transmitted to the first output of the voice path switching device.
[2] 2. A computer readable medium as defined in claim 1, wherein said task specification signal includes a telephone number to be dialled.
[3] 3. A voice path switching device comprising:- a first input for receiving a speech signal;- a second input for connection to a computing unit;- a first output for connection to a telephone network;- a second output for connection to the computing unit, the voice path switching device being capable of acquiring two operative modes, namely a first operative mode in which a speech signal received at the first input is transmitted to the first output and a second operative mode in which a speech signal received at the first input is transmitted to the second output for processing by the computer, the voice path switching device being responsive to a task specification signal received at the second input from the computing unit to output from the first output a telephone network control message, and the voice path switching device being responsive to a control message whereby the operative mode is selected.
[4] 4. A voice path switching device as defined in claim 3, wherein the task specification signal includes a telephone number to be dialed by said voice path switching device.
[5] 5. A telephone instrument comprising the voice path switching device defined in claim 3.
[6] 6. A personal voice dialer system for allowing the user to dial a telephone number by uttering the name of a called party, said system including in combination a voice path switching device as defined in claim 3 and a computer operating according to the instructions in a program element as defined in claim 1.
类似技术:
公开号 | 公开日 | 专利标题
Cox et al.2000|Speech and language processing for next-millennium communications services
Rabiner1997|Applications of speech recognition in the area of telecommunications
US6601029B1|2003-07-29|Voice processing apparatus
US6327343B1|2001-12-04|System and methods for automatic call and data transfer processing
EP1047046B1|2003-07-23|Distributed architecture for training a speech recognition system
US7783475B2|2010-08-24|Menu-based, speech actuated system with speak-ahead capability
Rabiner1994|Applications of voice processing to telecommunications
US6891932B2|2005-05-10|System and methodology for voice activated access to multiple data sources and voice repositories in a single session
US6895257B2|2005-05-17|Personalized agent for portable devices and cellular phone
US5615296A|1997-03-25|Continuous speech recognition and voice response system and method to enable conversational dialogues with microprocessors
KR100383353B1|2003-10-17|Speech recognition apparatus and method of generating vocabulary for the same
US6651042B1|2003-11-18|System and method for automatic voice message processing
US6996531B2|2006-02-07|Automated database assistance using a telephone for a speech based or text based multimedia communication mode
US6462616B1|2002-10-08|Embedded phonetic support and TTS play button in a contacts database
US5930336A|1999-07-27|Voice dialing server for branch exchange telephone systems
US20100217591A1|2010-08-26|Vowel recognition system and method in speech to text applictions
US20030149566A1|2003-08-07|System and method for a spoken language interface to a large database of changing records
JPH09186770A|1997-07-15|Method for automatic voice recognition in phone call
US5752230A|1998-05-12|Method and apparatus for identifying names with a speech recognition program
JPH07210190A|1995-08-11|Method and system for voice recognition
US6563911B2|2003-05-13|Speech enabled, automatic telephone dialer using names, including seamless interface with computer-based address book programs
GB2348035A|2000-09-20|Speech recognition system
US6671354B2|2003-12-30|Speech enabled, automatic telephone dialer using names, including seamless interface with computer-based address book programs, for telephones without private branch exchanges
US20010056345A1|2001-12-27|Method and system for speech recognition of the alphabet
Sorin et al.1995|Operational and experimental French telecommunication services using CNET speech recognition and text-to-speech synthesis
同族专利:
公开号 | 公开日
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
WO2008138257A1|2007-05-14|2008-11-20|Huawei Technologies Co., Ltd.|A speech recognition device and speech communication method|
WO2013120796A1|2012-02-16|2013-08-22|Continental Automotive Gmbh|Method for phonetising a data list and speech-controlled user interface|
法律状态:
2000-12-06| EEER| Examination request|
2005-07-29| FZDE| Dead|
优先权:
申请号 | 申请日 | 专利标题
US15210998A| true| 1998-09-14|1998-09-14||
US09/152109||1998-09-14||
[返回顶部]